The initial reason I started this project was to work on a web scraping endeavor. What I began with was part one of this project, and scraping the initial mass shooting data was very straightforward, and I learned a great deal about how to accomplish that process in R.
However, as things progressed and I wanted to mine more of the incident data, I discovered that the GVA website has a robot checking algorithm in place to verify that a browser is likely a human and not a bot. There were several helpful directives online regarding using selenium as a headless browser, which I started doing in using RSelenium. The website still recognized that I was using a controlled browser, and it denied my access.
Eventually, I discovered that I needed to adjust some of the Chrome options, and I was not able to make that happen in R, so I knew that I needed to switch to Python from the many solutions I found online. I did just that, and it was a solid learning experience for me as I have been working on enhancing my Python skills, and this was a practical avenue to move forward with that objecive.
The following took a few weeks of research, and there were a lot of late nights and weekends put into this effort. In all, it was worth it, but I did quesiton pursuing this at times given the level of effort just to scrape these pages. I really wanted to get the best latitude and longitude information, and an earlier attempt to get coordinates using the addresses was not the best given that many of the addresses are missing information. GVA seems to have better data on the coordinates behind the scenes.
Now I’ve got what I need to create some nice maps. Let’s go!
library(tidyverse)
library(tidymodels)
library(tidytext)
library(lubridate)
library(scales)
library(janitor)
library(skimr)
library(tidyquant)
library(broom)
library(purrr)
library(tidycensus)
library(leaflet)
library(usmap)
library(reticulate)
api_key <- Sys.getenv("CENSUS_API_KEY")
Read in the saved data file.
complete_tbl <- read_rds("gva_data.rds")
Here I prepare the primary data set to include some date-related fields for convenience when plotting and summarizing the data.
ms_tbl <- complete_tbl %>%
janitor::clean_names() %>%
select(-operations) %>%
mutate(incident_date = mdy(incident_date),
year = year(incident_date),
month = month(incident_date, label = T, abbr = T),
incident_month = rollforward(incident_date),
dow = wday(incident_date, label = TRUE)) %>%
arrange(incident_date)
Pull the incident IDs from the GVA data which then become part of the URL for the incident-specific information. This was from the original pull for all incidents, but after gathering the historicals, I only need to update with new incidents.
incident_ids <- ms_tbl %>%
distinct(incident_id) # get a list of all of the incident IDs
geo_locations_incidents_tbl <- incident_ids %>%
mutate(pages = str_glue("https://www.gunviolencearchive.org/incident/{incident_id}")) # create a tibble with all of the URLs needed
Here we only need to evaluate on the new incidents and then look up those geolocations.
ms_tbl_updated_geo_locs <- read_rds("ms_tbl_updated_geo_locs.rds")
new_incident_ids <- ms_tbl %>%
anti_join(ms_tbl_updated_geo_locs, by = "incident_id") %>%
distinct(incident_id)
new_geo_locations_incidents_tbl <- new_incident_ids %>%
mutate(pages = str_glue("https://www.gunviolencearchive.org/incident/{incident_id}")) # create a tibble with all of the URLs needed
Now we can scrape using Python which gives us a lot more flexibility than RSelenium does (at least as far as I was able to discern).
# run these in terminal
# pip install selenium
# pip install chromedriver
# pip install beautifulsoup4
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
from time import sleep
import pandas as pd
options = Options() # these are the options needed to hide that the browser was being controlled by a computer and not a human
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument("--profile-directory=Default")
options.add_experimental_option("useAutomationExtension", False)
driver = webdriver.Chrome(options=options) # this established the browser
first_page = "https://www.gunviolencearchive.org/incident/882422" # this is a default page to start with
driver.get(first_page) # this opens the URL in the established browser
# from here, I had to click on a second tab manually, and open another page on the same website which cleared the bot limiter I was running into
# then, I was able to run the loop as follows
url_list = r.new_geo_locations_incidents_tbl # creates a dataframe in python from the R dataframe created earlier
temp_list = {
"geo_data": []
} # this is an empty list container that will hold the HTML text as we iterate over the pages.
for url in url_list["pages"]: # loop to open the pages, parse the HTML, parse, the text from the page, pop it into a list, rinse, and repeat
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
geo_loc = soup.get_text()
temp_list["geo_data"].append(geo_loc)
df = pd.DataFrame.from_dict(temp_list) # this moves the list into a dataframe
driver.close() # this closes the session.
# note that I was not able to iterate of 4000 pages as the bot detector blocked my IPs from my VPN a few times
# I broke the chunks down into 200 to 300 at a time, and that seemed to do the trick
# I probably could have used a sleep command to slow the roll, which might have appeased the bot police....
Pull the Python object into R, and back up the scraped data in case of R crash.
geo_data <- py$df # this command moves a Python object into an R object...Reticulate is pretty sweet!
write_rds(geo_data, "scrapped_geo_data.rds") # I stored the data as I went in case R crashed
Update the geo locations file with the new records and save it.
geo_data <- read_rds("scrapped_geo_data.rds")
output <- geo_data %>%
transmute(geo_data = str_extract(geo_data, "Geolocation.*")) # pull the geo location data
ms_tbl_updated <- ms_tbl %>%
anti_join(ms_tbl_updated_geo_locs, by = "incident_id") %>%
bind_cols(output) # combine the original data with the new geo location data
tbl_to_update <- read_rds("ms_tbl_updated_geo_locs.rds") # use this to source a previously stored table and append it
ms_tbl_updated <- tbl_to_update %>%
bind_rows(ms_tbl_updated) # append the new data rows to the old
write_rds(ms_tbl_updated, "ms_tbl_updated_geo_locs.rds") # save it
ms_tbl_updated %>%
summarize(n = n(),
n_dist = n_distinct(incident_id)) # check it
ms_tbl_map_data <- read_rds("ms_tbl_updated_geo_locs.rds")
ms_tbl_map_prep <- ms_tbl_map_data %>%
rename("GeoLocation" = geo_data) %>%
mutate(temp = GeoLocation,
temp = str_remove_all(temp, "Geolocation: ")) %>%
separate(temp, into = c("latitude","longitude"), sep = ", ") %>%
mutate(across(latitude:longitude, ~ as.numeric(.)),
color_year = factor(year),
size = number_killed + number_injured,
date_formatted = format.Date(incident_date, format = "%b %d, %Y"),
label = str_glue("{city_or_county}, {state}
{date_formatted}
Injured: {number_injured} | Killed: {number_killed}"))
# List the available map tiles
#names(providers)
# Create a leaflet map with default map tile using addTiles()
ms_tbl_map_prep %>%
leaflet(height = 1000, width = 1000) %>%
addProviderTiles("CartoDB") %>%
addCircleMarkers(label = lapply(str_glue("{month(ms_tbl_map_prep$incident_date, label = TRUE, abbr = TRUE)} {year(ms_tbl_map_prep$incident_date)} <br/> {ms_tbl_map_prep$city_or_county}, {ms_tbl_map_prep$state} <br/> {ms_tbl_map_prep$number_killed} killed | {ms_tbl_map_prep$number_injured} wounded"),
htmltools::HTML), radius = 3, color = "black", clusterOptions = markerClusterOptions()) %>%
addProviderTiles("CartoDB", group = "CartoDB") %>%
addProviderTiles("Esri", group = "Esri") %>%
addLayersControl(baseGroups = c("Esri","CartoDB")) # Use addLayersControl to allow users to toggle between basemaps
worst_shootings <- ms_tbl_map_prep %>%
slice_max(order_by = size, n = 10) %>%
slice(1:10)
worst_transformed <- usmap_transform(worst_shootings, input_names = c("longitude", "latitude"), output_names = c("x", "y"))
set.seed(1984)
plot_usmap(fill = "goldenrod1", alpha = 0.8) +
geom_point(data = worst_transformed, aes(x = x, y = y, size = size, color = color_year, text = label)) +
ggrepel::geom_label_repel(data = worst_transformed,
aes(x = x, y = y, label = label),
size = 2,
color = "black",
min.segment.length = 0.1,
segment.colour = "blue3",
box.padding = 1,
arrow = arrow(length = unit(0.01, "npc"))) +
scale_color_tq() +
labs(title = str_glue("Map of the Ten Worst Mass Shooting Incidents Between {year(min(ms_tbl$incident_date))} to {year(max(ms_tbl$incident_date))}"),
subtitle = "Point size represents the total deaths and injuries per incident") +
theme(legend.position = "none")
library(gganimate)
all_transformed <- usmap_transform(ms_tbl_map_prep, input_names = c("longitude", "latitude"), output_names = c("x", "y")) %>%
filter(state %in% c("Illinois"), city_or_county == "Chicago")
plot_usmap(regions = "county", alpha = 0.8, include = 17031, labels = TRUE) +
geom_density2d(data = all_transformed,
aes(x, y, fill = ..level.., alpha = ..level..),
bins = 20) +
stat_density2d(data = all_transformed,
aes(x, y, fill = ..level.., alpha = ..level..),
bins = 20,
geom = "density_2d_filled") +
scale_fill_gradient(low = "lightblue", high = "firebrick3") +
labs(title = "Heatmap of Mass Shooting Incidents by Year",
subtitle = str_c("Year: {closest_state}"),
x = "",
y = "") +
theme(legend.position = "none") +
transition_states(year)
Mappings with census data?